Minimax Rates of Estimation for Sparse PCA in High Dimensions 
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Abstract 



We study sparse principal components anal- 
ysis in the high-dimensional setting, where p 
(the number of variables) can be much larger 
than n (the number of observations). We 
prove optimal, non-asymptotic lower and up- 
per bounds on the minimax estimation error 
for the leading eigenvector when it belongs 
to an l q ball for q £ [0,1]. Our bounds are 
sharp in p and n for all q € [0, 1] over a wide 
class of distributions. The upper bound is 
obtained by analyzing the performance of £ q - 
constrained PCA. In particular, our results 
provide convergence rates for ^-constrained 
PCA. 



1 Introduction 

High-dimensional data problems, where the number of 
variables p exceeds the number of observations n, are 
pervasive in modern applications of statistical infer- 
ence and machine learning. Such problems have in- 
creased the necessity of dimensionality reduction for 
both statistical and computational reasons. In some 
applications, dimensionality reduction is the end goal, 
while in others it is just an intermediate step in the 
analysis stream. In either case, dimensionality reduc- 
tion is usually data-dependent and so the limited sam- 
ple size and noise may have an adverse affect. Princi- 
pal components analysis (PCA) is perhaps one of the 
most well known and widely used techniques for un- 
supervised dimensionality reduction. However, in the 
high-dimensional situation, where p/n does not tend 
to as n — > oo, PCA may not give consistent esti- 
mates of eigenvalues and eigenvectors of the popula- 
tion covariance matrix [12]. To remedy this situation, 
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sparsity constraints on estimates of the leading eigen- 
vectors have been proposed and shown to perform well 
in various applications. In this paper we prove opti- 
mal minimax error bounds for sparse PCA when the 
leading eigenvector is sparse. 

1.1 Subspace Estimation 

Suppose we observe i.i.d. random vectors e W, 
i = 1, ... ,n and we wish to reduce the dimension of 
the data from p down to k. PCA looks for k uncorre- 
cted, linear combinations of the p variables that have 
maximal variance. This is equivalent to finding a k- 
dimensional linear subspace whose orthogonal projec- 
tion A minimizes the mean squared error 



mse(A) = E\\(Xi - EX,) - A(X t - EXi] 



(1) 



[see 10, Chapter 7.2.3 for example]. The optimal sub- 
space is determined by spectral decomposition of the 
population covariance matrix 



E = EXiX? - (EX, i ){EX i ) T = Vj<? 



(2) 



where Ai > A2 > ■ • • X p > are the eigenvalues and 
01, . . • p <G W, orthonormal, are eigenvectors of E. If 
Afc > Afc-|_i, then the optimal /c-dimensional linear sub- 
space is the span of 8 = (61, . . . , 9k) and its projection 
is given by n = 00 T . Thus, if we know E then we may 
optimally (in the sense of eq. (1)) reduce the dimension 
of the data from p to k by the mapping x 1— > QQ T x. 

In practice, E is not known and so must be esti- 
mated from the data. In that case we replace by 
an estimate and reduce the dimension of the data 
by the mapping x >-> tlx, where II = 00 T . PCA uses 
the spectral decomposition of the sample covariance 
matrix 



pAn 



s = ^x i xr-xx- = i:i 



where X is the sample mean, and lj and Uj are eigen- 
values and eigenvectors of S defined analogously to 
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eq. (2). It reduces the dimension of the data to k by 
the mapping x >->• UU T x, where U = (u\, . . . , Ufc). 

In the classical regime where p is fixed and n — > oo, 
PCA is a consistent estimator of the population eigen- 
vectors. However, this scaling is not appropriate for 
modern applications where p is comparable to or larger 
than n. In that case, it has been observed [18, 17, 12] 
that if p, n — > oo and p/n — > c > 0, then PCA can be 
an inconsistent estimator in the sense that the angle 
between u\ and 9\ can remain bounded away from 
even as n — > oo. 

1.2 Sparsity Constraints 

Estimation in high-dimensions may be beyond hope 
without additional structural constraints. In addition 
to making estimation feasible, these structural con- 
straints may also enhance interpretability of the es- 
timators. One important example of this is sparsity. 
The notion of sparsity is that a few variables have large 
effects, while most others are negligible. This type of 
assumption is often reasonable in applications and is 
now widespread in high-dimensional statistical infer- 
ence. 

Many researchers have proposed sparsity constrained 
versions of PCA along with practical algorithms, and 
research in this direction continues to be very active 
[e.g., 13, 27, 6, 21, 25]. Some of these works are based 
on the idea of adding an l\ constraint to the estimation 
scheme. For instance, Jolliffe, Trendafilov, and Uddin 
[13] proposed adding an l\ constraint to the variance 
maximization formulation of PCA. Others have pro- 
posed convex relaxations of the "hard" £o-constraincd 
form of PCA [6]. Nearly all of these proposals are based 
on an iterative approach where the eigenvectors are 
estimated in a one-at-a-time fashion with some sort 
of deflation step in between [14]. For this reason, we 
consider the basic problem of estimating the leading 
population eigenvector 9\. 

The l q balls for q g [0,1] provide an appealing way to 
make the notion of sparsity concrete. These sets are 
defined by 

W q (R q ) = {0eW :£f=il^l 9 <#J 

and 

Bg(iZo) -{9et p : £"=il{^o } < R } ■ 

The case q = corresponds to "hard" sparsity where 
Rq is the number of nonzero entries of the vectors. For 
q > the l q balls capture "soft" sparsity where a few 
of the entries of 9 are large, while most are small. The 
soft sparsity case may be more realistic for applications 
where the effects of many variables may be very small, 
but still nonzero. 



1.3 Minimax Framework and 
High-Dimensional Scaling 

In this paper, we use the statistical minimax frame- 
work to elucidate the difficulty /feasibility of estima- 
tion when the leading eigenvector 9\ is assumed to be- 
long to W q (Rq) for q g [0, 1]. The framework can make 
clear the fundamental limitations of statistical infer- 
ence that any estimator 9\ must satisfy. Thus, it can 
reveal gaps between optimal estimators and computa- 
tionally tractable ones, and also indicate when practi- 
cal algorithms achieve the fundamental limits. 

Parameter space There are two main ingredients 
in the minimax framework. The first is the class of 
probability distributions under consideration. These 
are usually associated with some parameter space cor- 
responding to the structural constraints. Formally, 
suppose that Ai > A2. Then we may write eq. (2) 
as 

E = Aifliflf + Aa£ , ( 3 ) 

where Ai > A2 > 0, 9\ g §2 _1 (the unit sphere of £2), 
E y 0, E 6> = 0, and ||E ||2 = 1 (the spectral norm 
of Eo). In model (3), the covariance matrix E has a 
unique largest eigenvalue Ai. Throughout this paper, 
for q E [0, 1], we consider the class 

M q (\i,X 2 , R q ,a, k) 

that consists of all probability distributions on Xi g 
W,i=l,...,n satisfying model (3) with 9 1 g W , q {R q + 
1), and Assumption 2.1 (below) with a and k depend- 
ing on q only. 

Loss function The second ingredient in the mini- 
max framework is the loss function. In the case of 
subspace estimation, an obvious criterion for evaluat- 
ing the quality of an estimator O is the squared dis- 
tance between 9 and 8. However, it is not appropriate 
because is not unique — O and QV span the same 
subspace for any k x k orthogonal matrix V. On the 
other hand, the orthogonal projections n = OO r and 
II = OO t are unique. So we consider the loss function 
defined by the Frobenius norm of their difference: 

l|n-n|| F . 

In the case where k = 1, the only possible non- 
uniqueness in the leading eigenvector is its sign am- 
biguity. Still, we prefer to use the above loss function 
in the form 

because it generalizes to the case k > 1. Moreover, 
when k = 1, it turns out to be equivalent to both 
the Euclidean distance between 61, 9\ (when they be- 
long to the same half-space) and the magnitude of the 
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sine of the angle between 9\, Q\. (See Lemmas A. 1.1 
and A. 1.2 in the Appendix.) 

Scaling Our goal in this work is to provide non- 
asymptotic bounds on the minimax error 

min max E P ||0i0f - W\\f , 

where the minimum is taken over all estimators that 
depend only on X\,...,X n , that explicitly track 
the dependence of the minimax error on the vector 
(p, n, Ai, A2, R q ). As we stated early, the classical p 
fixed, n — > 00 scaling completely misses the effect 
of high-dimensionality; we, on the other hand, want 
to highlight the role that sparsity constraints play in 
high-dimensional estimation. Our lower bounds on the 
minimax error use an information theoretic technique 
based on Fano's Inequality. The upper bounds are 
obtained by constructing an .^-constrained estimator 
that nearly achieves the lower bound. 

1.4 ^-Constrained Eigenvector Estimation 

Consider the constrained maximization problem 
maximize b T Sb, 

(4) 

subject to b g S^" 1 nlj(p,) 

and the estimator defined to be the solution of the 
optimization problem. The feasible set is non-empty 
when p q > 1, and the £ q constraint is active only when 
p q < p 1_ 2 . The £ g -constrained estimator corresponds 
to ordinary PC A when q = 2 and p q = 1. When 
q G [0,1], the l q constraint promotes sparsity in the 
estimate. Since the criterion is a convex function of b, 
the convexity of the constraint set is inconsequential — 
it may be replaced by its convex hull without changing 
the optimum. 

The case q = 1 is the most interesting from a practical 
point of view, because it corresponds to the well-known 
Lasso estimator for linear regression. In this case, 
eq. (4) coincides with the method proposed by Jol- 
liffe, Trendafilov, and Uddin [13], though (4) remains 
a difficult convex maximization problem. Subsequent 
authors [21, 25] have proposed efficient algorithms that 
can approximately solve eq. (4). Our results below are 
(to our knowledge) the first convergence rate results 
available for this ^-constrained PCA estimator. 

1.5 Related Work 

Amini and Wainwright [1] analyzed the performance 
of a scmidefinite programming (SDP) formulation of 
sparse PCA for a generalized spiked covariance model 
[11]. Their model assumes that the nonzero entries of 



the eigenvector all have the same magnitude, and that 
the covariance matrix corresponding to the nonzero 
entries is of the form pOidf + I. They derived upper 
and lower bounds on the success probability for model 
selection under the constraint that B\ G Mq(Rq). Their 
upper bound is conditional is conditional on the SDP 
based estimate being rank 1 . Model selection accuracy 
and estimation accuracy are different notions of accu- 
racy. One does not imply the other. In comparison, 
our results below apply to a wider class of covariance 
matrices and in the case of £q we provide sharp bounds 
for the estimation error. 

Operator norm consistent estimates of the covariance 
matrix automatically imply consistent estimates of 
eigenspaces. This follows from matrix perturbation 
theory [see, e.g., 22]. There has been much work on 
finding operator norm consistent covariance estimators 
in high-dimensions under assumptions on the sparsity 
or bandability of the entries of £ or S" 1 [see, e.g., 3, 
2, 7]. Minimax results have been established in that 
setting by Cai, Zhang, and Zhou [5]. However, spar- 
sity in the covariance matrix and sparsity in the lead- 
ing eigenvector are different conditions. There is some 
overlap (e.g. the spiked covariance model), but in gen- 
eral, one does not imply the other. 

Raskutti, Wainwright, and Yu [20] studied the related 
problem of minimax estimation for linear regression 
over l q balls. Remarkably, the rates that we derive 
for PCA are nearly identical to those for the Gaussian 
sequence model and regression. The work of Raskutti, 
Wainwright, and Yu [20] is close to ours in that they 
inspired us to use some similar techniques for the up- 
per bounds. 

While writing this paper we became aware of an un- 
published manuscript by Paul and Johnstone [19]. 
They also study PCA under £ q constraints with a 
slightly different but equivalent loss function. Their 
work provides asymptotic lower bounds for the min- 
imax rate of convergence over £ q balls for q £ (0,2]. 
They also analyze the performance of an estimator 
based on a multistage thresholding procedure and 
show that asymptotically it nearly attains the optimal 
rate of convergence. Their analysis used spiked covari- 
ance matrices (corresponding to A2S0 = {I P — diOf) 
in eq. (3) when k = 1), while we allow a more general 
class of covariance matrices. We note that our work 
provides non- asymptotic bounds that are optimal over 
(j>, n, Rq) when q € {0, 1} and optimal over (p, n) when 

?e (0,1). 

In next section, we present our main results along with 
some additional conditions to guarantee that estima- 
tion over M q remains non-trivial. The main steps of 
the proofs are in Section 3. In the proofs we state 
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some auxiliary lemmas. They are mainly technical, so 
we defer their proofs to the Appendix. Section 4 con- 
cludes the paper with some comments on extensions 
of this work. 

2 Main Results 

Our minimax results are formulated in terms of 
non-asymptotic bounds that depend explicitly on 
(n,p, R q , Ai, A2). To facilitate presentation, we intro- 
duce the notations 



R q — 1 and a 



Ai A2 



(Ai - A 2 ) 2 



R q appears naturally in our lower bounds because the 
eigenvector 6\ belongs to the sphere of dimension p—1 
due to the constraint that ||#i|| 2 = 1- Intuitively, 
<r 2 plays the role of the effective noise-to-signal ratio. 
When comparing with minimax results for linear re- 
gression over £ q balls, a 2 is exactly analogous to the 
noise variance in the linear model. Throughout the 
paper, there are absolute constants c, C, ci, etc,. . . that 
may take different values in different expressions. 

The following assumption on R q , the size of the i q ball, 
is to ensure that the eigenvector is not too dense. 

Assumption 2.1. There exists a £ (0,1], depending 
only on q, such that 



Rq<^{p-iy- a R 2 q ^ 



a . P 

— log- 

n 



Ra 



(5) 



where K < ca/lQ is a constant depending only on q, 
and 

l<R q <e- 1 {p-l) 1 -' 1 / 2 . (6) 

Assumption 2.1 also ensures that the effective noise a 2 
is not too small — this may happen if the spectral gap 
Ai — A2 is relatively large or if A2 is relatively close 

1 II 

to 0. In either case, the distribution of Xi/\ x would 
concentrate on a 1-dimcnsional subspace and the prob- 
lem would effectively degrade into a low-dimensional 
one. If R q is relatively large, then S^ 1 C\V>q(R q ) is not 
much smaller than S^ -1 and the parameter space will 
include many non-sparse vectors. In the case q = 0, 
Assumption 2.1 simplifies because we may take a = 1 
and only require that 

1 < Ro < e-\p-l). 



for some a' £ [0,1], is sufficient to ensure that (5) 
holds for q £ (0, 1]. Alternatively, if we let a — l — q/2 
then (5) is satisfied for q £ (0, 1] if 

1 < «V ((p - l)/n) log {(p - l)/Rp) . 



The relationship between n, p, R q and a 2 described in 
Assumption 2.1 indicates a regime in which the infer- 
ence is neither impossible nor trivially easy. We can 
now state our first main result. 

Theorem 2.1 (Lower Bound for Sparse PCA). Let 
q £ [0,1]. If Assumption 2.1 holds, then there exists 
a universal constant c > depending only on q, such 
that every estimator 0\ satisfies 



max 



E P ||W-Mflk 



> c min < 1 , Rq 



-log ( P -l)/i? 9 2 
n 



Our proof of Theorem 2.1 is given in Section 3.1. It fol- 
lows the usual nonparametric lower bound framework. 
The main challenge is to construct a rich packing set 
in S^" 1 nMP{R q ). (See Lemma 3.1.2.) We note that a 
similar construction has been independently developed 
and applied in similar a context by Paul and Johnstone 
[19]. 

Our upper bound result is based on analyzing the solu- 
tion to the ^-constrained maximization problem (4), 
which is a special case of empirical risk minimization. 
In order to bound the empirical process, we assume 
the data vector has sub- Gaussian tails, which is nicely 
described by the Orlicz i/i a -norm. 

Definition 2.1. For a random variable Y £ R, the 
Orlicz ipa-norm is defined for a > I as 

\\Y\\ ila = inf{c > : Eexp(|F/c| Q ) < 2} . 

Random variables with finite ?/> Q -norm correspond to 
those whose tails arc bounded by exp(— Cx a ). 

The case a = 2 is important because it corresponds 
to random variables with sub-Gaussian tails. For ex- 
ample, if Y ~ Af(0,a 2 ) then \\Y\\^ 2 < Co for some 
positive constant C. See [24, Chapter 2.2] for a com- 
plete introduction. 

Assumption 2.2. There exist i.i.d. random vectors 
Zi,...,Z n €W such that EZ Z = 0, EZ l Z^ = I p , 



In the high-dimensional case that we are interested, 
where p > n, the condition that 



Xi = /t + Y}I 2 Zi and sup | 



(Zi,x) 



<K, 



l<R q < e~ L K q cr q p 



1 KlrrlJ 1 - a ">/ 2 



where \i £ R p and K > is a constant. 
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Assumption 2.2 holds for a variety of distributions, in- 
cluding the multivariate Gaussian (with K 2 = 8/3) 
and those of bounded random vectors. Under this as- 
sumption, we have the following theorem. 

Theorem 2.2 (Upper Bound for Sparse PCA). Let 
9\ be the t q constrained PCA estimate in eq. (4) with 

Pi 



R q , and let 



\\f and a = Ai/(Ai — A2) 



If the distribution of (Xi,...,X n ) belongs to 
A4q(Xi, A2, Rq, a, k) and satisfies Assumptions 2.1 and 
2.2, then there exists a constant c > depending only 
on K such that the following hold: 

1. Ifqe (0,1), then 









^,R 2 q 


— logp 






n 





2. Ifq= 1, then 

Ee 2 < cmin { 1 , R x 



— log [p/Ri 
n 



3. Ifq = 0, then 

[Ee] 2 < cmin { 1 , R — log (p/R 



The proof of Theorem 2.2 is given in Section 3.2. The 
different bounds for q = 0, q = 1, and q G (0, 1) are due 
to the different tools available for controlling empirical 
processes in £ q balls. Comparing with Theorem 2.1, 
when q = 0, the lower and upper bounds agree up to 
a factor WX^/Xi- In the cases of p = 1 and p G (0, 1), 
a lower bound in the squared error can be obtained 
by using the fact KY 2 > (EY) 2 . Therefore, over the 
class of distributions in M q (\i, A2, R q , a, n) satisfying 
Assumptions 2.1 and 2.2, the upper and lower bound 
agree in terms of (p, n) for all q G (0, 1), and are sharp 
in (p, n, R q ) for q G {0, 1}. 

3 Proofs of Main Results 

We use the following notation in the proofs. For ma- 
trices A and B whose dimensions are compatible, we 
define (A,B) = Tr(A T B). Then the Frobenius norm 



is 



\A\\ 2 F 



= (A, A). The Kullback-Leibler (KL) diver- 



gence between two probability measures Pi, 
noted by D(Pi||P 2 ). 



is de- 



3.1 Proof of the Lower Bound (Theorem 2.1) 

Our main tool for proving the minimax lower bound is 
the generalized Fano Method [9] . The following version 
is from [26, Lemma 3]. 

Lemma 3.1.1 (Generalized Fano method). Let N > 1 
be an integer and 9± , . . . , 6n C index a collection of 
probability measures ¥g i on a measurable space (X ', A). 
Let d be a pseudometric on O and suppose that for all 

d(9 u 9 3 ) > a N 

and 

r>(Pe,.||Pfl,) < Pn- 

Then every A-measurable estimator 9 satisfies 



w ,ira a \ ^ aN (a Pn + log2 

maxEfl.alc/, 0; > 1 

i 2 \ log AT 



The method works by converting the problem from 
estimation to testing by discretizing the parameter 
space, and then applying Fano's Inequality to the test- 
ing problem. (The (3n term that appears above is an 
upper bound on the mutual information.) 

To be successful, we must find a sufficiently large finite 
subset of the parameter space such that the points in 
the subset are ajv-scparated under the loss, yet nearly 
indistinguishable under the KL divergence of the corre- 
sponding probability measures. We will use the subset 
given by the following lemma. 

Lemma 3.1.2 (Local packing set). Let R q = R q — 1 > 



p-i 



n 



1 and p > 5. There exists a finite subset e C §' 
M^(Rg) and an absolute constant c > such that every 
distinct pair #1, #2 € e satisfies 



2 < 



log|6 e | > c 



R, 



-O2W2 < V2e 



Iog(p - 1) - log 



for all q G [0, 1] and e G (0, 1]. 

Fix e G (0, 1] and let 6 e denote the set given by 
Lemma 3.1.2. With Lemma A. 1.2 we have 



e 2 /2< ||Mi - 2 9i\\ 2 F <4e 2 (7) 

for all distinct pairs 61,82 G O e . For each 8 G O c , let 

S e = (Ai - A 2 )^ T + A 2 / P . 

Clearly, Eg has eigenvalues Ai > A2 = • • • = A p . Then 
Eg satisfies eq. (3). Let Vg denote the n-fold prod- 
uct of the A/"(0, Eg) probability measure. We use the 
following lemma to help bound the KL divergence. 
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Lemma 3.1.3. For i = 1,2, let x; € Sf \ Ai > A 2 > 

0, 



S l = (Ai - X 2 )xixf + X 2 I P , 

and Pi be the n-fold product of the J\f(0, Si) probability 
measure. Then 

D(¥ 1 \\F 2 ) = ^Wxuf-XixUp, 

where a 2 = AiA 2 /(Ai - A 2 ) 2 . 

Applying this lemma with eq. (7) gives 



^(PflillPfa) 



2rr 



Thus, we have found a subset of the parameter space 
that conforms to the requirements of Lemma 3.1.1, and 
so 



max E 1 1 00 1 

eee. 



> 



2V2 



1 - 



2ne 2 /a 2 + log 2 
bgje^ 



for all e £ (0, 1]. The final step is to choose e of the 
correct order. If we can find e so that 



2ne 2 /a 2 1 

iog|e e | - 4 

and 

log|0 e | > 41og2, 
then we may conclude that 

maxEJ|^ T -00|| F > . 
eee € 4-y/2 

For a constant C € (0, 1) to be chosen later, let 



(8) 
(9) 



e 2 = min \ l,C 2 ~ q R q 



a 2 p-1 
— log — 

71 fl* 3 * 



(10) 



We consider each of the two cases in the above 
min{- ■ • } separately. 



Case 1: Suppose that 



1 < C 2 - q R a 



■log 



P-1 

Rf~ q 



(ii) 



Then e 2 = 1 and by rearranging (11) 



C 2 a 2 S 



So by Lemma 3.1.2, 



iog(p-i)-io g i?|-' 



If we choose C 2 < c/16, then 



2ne 2 /ci 2 AC 2 1 
log|6 e | " ~ " 4 



To lower bound 

io g |e e | > cR^ 5 



log(p-l)-logi^ 



observe that the function x 1— > x\og[{p — l)/x\ is in- 
creasing on [1, (p — l)/e], and, by Assumption 2.1, this 
interval contains Rq^ 2 ~ q \ 11 p is large enough so that 
p — 1 > exp{(4/c) log 2}, then 

log|e e |>clog(p-l)>41og2. 

Thus, eqs. (8) and (9) are satisfied, and we conclude 
that 



max Eg 
0ee £ 



-06\\ F > 



Ay/2 



as long as C 2 < c/16 and p — 1 > exp{(4/c) log2}. 
Case 2: Now let us suppose that 



1 > C 2 - q R a 



a 2 p-1 
— log— 5- 

n R 2 a - q 



(12) 



Then 



Rq 



log 



p-1 



(13) 



and it is straightforward to check that Assumption 2.1 
implies that if C q > n q , then there is a G (0, 1], de- 
pending only on q, such that 



1 \ 2-9 p-1 
e q ' 



(14) 



So by Lemma 3.1.2, 
log|9 e | 

>c|'^i 



Rq' 



p-i IV-' 

log ^- log b 



Rq 

> ecu— 
- 



— loi 



Ro 



p-1 
Rt~ q 



lot 



p-1 



(15) 



where the last inequality is obtained by plugging in 
(13) and (14). 

If we choose C 2 < ca/16, then combining (10) and 
(15), we have 



loglGJ >cR< 



log(p- 1) -logi?, 



2-q 
1 



> 



C 2 a 2 



4C 2 1 
— — — < < - 

log|6 e | ca 4 



(16) 
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and eq. (8) is satisfied. On the other hand, by (12) 
and the fact that R q > 1, we have 



c -q 

and hence (15) becomes 



o , P 
— log- 

n 



1 



> 1 



log 1 9 J > caR„ 



log 



p-1 



-2/(2-9)1 



(17) 



is increasing 



The function x H> xlog[(p — 1)/: 

on [1, (p - l) 1 ~ q/2 /e] and, by Assumption 2.1, 1 < 
Rq < (p~ l) l - q/2 /e. If p- 1 > exp{[4/(ca)]log2}, 
then 

log|9 e | > ca\og{p - 1) > 4 log 2 
and eq. (9) is satisfied. So we can conclude that 



max Efl 

0ee e 



-60\\ 



> 



4\/2 



as long as C 2 < ca/16 and p — 1 > cxp{ [4/ (ca)] log 2}. 

Cases 1 and 2 together: Looking back at cases 1 
and 2, we see that because a < 1, the conditions that 
k 2 < C 2 < ca/16 and p — 1 > cxp{[4/(ca)] log 2} are 
sufficient to ensure that 



max Eg 

0&e e 



ee\\ 



> c'mhW l,i? 9 2 



— lOj 

n 



p-1 



R 2 q - q 



2 4 



for a constant c' > depending only on q. ■ 

3.2 Proof of the Upper Bound (Theorem 2.2) 

We begin with a lemma that bounds the curvature of 
the matrix functional (T,,bb T ). 

Lemma 3.2.1. Let 9 € §?T . // E h ftas a unique 
largest eigenvalue Ai wi/i corresponding eigenvector 
6\, then 

^ - \ 2 )\\ee T - e^lwl < (e, 0i0f — 60 T ) . 

Now consider 6i, the £ g -constrained sparse PCA esti- 
mator of &l Let e = ||<Mf -6>i6>f j| F . Since 6>i e 
it follows from Lemma 3.2.1 that 

(\ 1 -\ 2 )e 2 /2<(Z,9 1 6l -ejf) 

= (5, Mi) - (E, 0i0?) - (5 - E, ) 

< (S-Zjjf) - (S - E, 0i0f) 

= (S-E,0i0?-0i0f). (is) 



We consider the cases g € (0,1), ? = 1, and g = 
separately. 

3.2.1 Case 1: g€ (0, 1) 

By applying Holder's Inequality to the right side of 
eq. (18) and rearranging, we have 



e 2 /2 < 



|vec(5 - E) 



|vec(Mf -MDlli. ( 19 ) 



Ai — A2 

where vec(A) denotes the 1 x p 2 matrix obtained by 
stacking the columns of a p x p matrix A. Since 9\ and 
§1 both belong to MP(R q ), 



< Hvec^flDll? 

< 2i? 2 . 



IvecC^Oll? 



Let t > 0. We can use a standard truncation argument 
[see, e.g., 20, Lemma 5] to show that 



|vec(0i0f 



Will 



-9/2 



2R 2 q t 1 ~ q 



< V2i?,||vec(0i0f -0i0f)|| 2 t 
= \/2ii g ||0i0f - 0i0f \\ F tr q/2 + 2R 2 q t 1 - q 
= V2R q et- q/2 + 2R 2 q t 1 - q . 

Letting t = ||vcc(5 — E)|| oc /(Ai — A2) and joining with 
eq. (19) gives us 

e 2 /2 < V^^RqC + 2t 2 - q R 2 . 

If we define m implicitly so that e = m\[2t 1 ~ q l 2 R qi 
then the preceding inequality reduces to to 2 / 2 < to+1. 
If to > 3, then this is violated. So wc must have m < 3 
and hence 



e < 3\/2t 1 -« /2 i? = 3V2R„ 



Hvec^-E)!^^ X - q/2 



Ai — A2 



(20) 



Combining the above discussion with the sub-Gaussian 
assumption, the next lemma allows us to bound 
||vec(fir-S)]|ao- 

Lemma 3.2.2. If Assumption 2.2 holds and E satis- 
fies (2), then there is an absolute constant c > such 
that 



|vec(5- E)|| 



< cK Ai max ■ 



logp logp 



Applying Lemma 3.2.2 to eq. (20) gives 
|vcc(S-E)|UL 



|e 2/(2 - 9) lk 



< CjR 2/(2- 9 ). 



Ai — A2 



< cA' 2 i? 2/(2 - 9) CTmax 



logp logp 
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The fact that E\X\ m < (m!) m ||X|j™ for to > 1 [see 
24, Chapter 2.2] implies the following bound: 



Ee < CRR q o~ v max<Y , > 



2-9 



M. 



Combining this with the trivial bound e < 2, yields 

Ee 2 < min(2,M) . (21) 

If logp > n, then Ee 2 < 2. Otherwise, we need only 
consider the square root term inside max{} in the def- 
inition of M. Thus, 



Ee 2 < cmin I 1 , Ri 



— logp 
n 



for an appropriate constant c > 0, depending only on 
K. This completes the proof for the case q £ (0, 1). 

3.2.2 Case 2: q = 1 

8 1 and 6\ both belong to Bj(i?i). So applying the 
triangle inequality to the right side of eq. (18) yields 

(Ai - A 2 )e 2 /2 < (S - S, eJl - B X B[ ) 

< \<%(S -Ufa] + \6f[(S-2)6 1 \ 

<2 sup \b T {S- E)b\ . 

besf _1 nBf (i?i) 

The next lemma provides a bound for the supremum. 

Lemma 3.2.3. If Assumption 2.2 holds and £ satis- 
fies (2), then there is an absolute constant c > such 
that 

E sup \b T (S - £)6| 

6€Si -I nB?(.Ri) 



< cAii4T 2 max < i?i 



/or all R\ e [l,p/e]. 



\og{p/Rl) D2 \og(p/R( 



R\- 



Assumption 2.1 guarantees that R\ £ [l,p/e\. Thus, 
we can apply Lemma 3.2.3 and an argument similar to 
that used with (21) to complete the proof for the case 
q=l. 

3.2.3 Case 3: q = 

We continue from eq. (18). Since B\ and 0\ belong 
to Bq(.Ro), their difference belongs to Sq(2Rq). Let 
LT denote the diagonal matrix whose diagonal entries 
are 1 wherever Q\ or 9\ are nonzero, and elsewhere. 
Then LI has at most 2Rq nonzero diagonal entries, and 



LI(9i = Q\ and II#i = By . So by the Von Neumann trace 
inequality and Lemma A. 1.1, 

(Ai - A 2 )e 2 /2 < |(5 - s, n(Mf - 0i0f)n)| 
= I (n(5-s)n, Ml - 0iOl)\ 

< iints-sjnyiMf -0i0f|| Sl 

= ||n(5 - E)n|| 2 V2e 

< sup \b T (S - Z)b\V2e , 

6es^ 1 nBg(2i? fl ) 

where |j • denotes the sum of the singular values. 
Divide both sides by e, rearrange terms, and then take 
the expectation to get 



Ee < 



Ax — A 2 



-E 



sup |6 T (5-S)6| 



6GSf _1 nBg(2flo) 



Lemma 3.2.4. If Assumption 2.2 holds and S satis- 
fies (2), then there is an absolute constant c > such 
that 



E sup |6 T (S*-E)6| 

faeS^ _1 nBg(d) 

< cK 2 X 1 max{V(d/n)log(p/d), (d/n) log(p/d)} 
for all integers d £ [l,p/2). 

Taking d = 2Rq and applying an argument similar to 
that used with (21) completes the proof of the q = 
case. ■ 

4 Conclusion and Further Extensions 

We have presented upper and lower bounds on the 
minimax estimation error for sparse PCA over l q balls. 
The bounds are sharp in (p, n) , and they show that £ q 
constraints on the leading eigenvector make estima- 
tion possible in high-dimensions even when the num- 
ber of variables greatly exceeds the sample size. Al- 
though we have specialized to the case k = 1 (for the 
leading eigenvector), our methods and arguments can 
be extended to the multi-dimensional subspace case 
(k > 1). One nuance in that case is that there are 
different ways to generalize the notion of £ q sparsity 
to multiple eigenvectors. A potential difficulty there 
is that if there is multiplicity in the eigenvalues or if 
eigenvalues coalesce, then the eigenvectors need not be 
unique (up to sign). So care must be taken to handle 
this possibility. 
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A APPENDIX - SUPPLEMENTARY 
MATERIAL 

A.l Additional Technical Tools 

We state below two results that we use frequently in 
our proofs. The first is well-known consequence of the 
CS decomposition. It relates the canonical angles be- 
tween subspaces to the singular values of products and 
differences of their corresponding projection matrices. 

Lemma A. 1.1 (Stewart and Sun [22, Theorem 1.5.5]). 
Let X and y be k- dimensional subspaces of W with 
orthogonal projections Hx and Tly. Let (J\ > 02 ~> 
■ ■ ■ > <7fc be the sines of the canonical angles between 
X and y . Then 

1. The singular values ofHx(I P — Tly) are 

oi,er 2 , . . . ,crfc,0, . . . ,0. 

2. The singular values of Tlx — Tly are 

O'l,o'l,o'2,c r 2,...,crfc,crfc,0, ...,0. 
Lemma A. 1.2. Letx,y£S^ 1 . Then 

\\xx T -yy T f F <2\\x-y\\ 2 
If in addition \\x — 2/ 1 1 2 < V%, then 

\\xx T -yy T \\ 2 F >\\x-y\\ 2 2 

Proof. By Lemma A. 1.1 and the polarization identity 

l\\^ T -yy T \\ 2 F = i-(x T y) 2 



= 1 - 



2 — IJrc — 



= \\x-y\\l-\\x-y\\l/4 

The upper bound follows immediately. Now if ||x — 
y\\ 2 < 2, then the above right-hand side is bounded 
from below by ||x — B 

A. 2 Proofs for Theorem 2.1 

Proof of Lemma 3.1.2 Our construction is based 
on a hypercube argument. We require a variation of 
the Varshamov-Gilbcrt bound due to Birge and Mas- 
sart [4]. We use a specialization of the version that 
appears in [15, Lemma 4.10]. 

Lemma. Let d be an integer satisfying 1 < d < 
(p — l)/4. There exists a subset fid C {0, 1} P_1 that 
satisfies the following properties: 



1. \\lj\\o = d for all lu £ fid, 

2. \\bj — w'||o > d/2 for all distinct pairs cj,u)' £ fid. 
and 

3. log|f2 rf | > cdlog{{p - l)/d), where c > 0.233. 

Let d £ [1, (p — l)/4] be an integer, fid be the cor- 
responding subset of {0, 1} P_1 given by preceding 
lemma, 

x(u) = ((l-e 2 )i,eLjd~i) £M p , 

and 

9 = {x(lj) : ui £ fl d } ■ 
Clearly, 9 satisfies the following properties: 



1. ec gp- 1 , 

2. e/y/2 < \\0i - 2 ||2 < V%e for all distinct pairs 
9x,6 2 £ Q d , 

3. \\6\\* < 1 + e«d( 2 "«)/ 2 for all 9 £ 6, and 

4. log|9| > cd[log(p - 1) - logd], where c > 0.233. 

To ensure that 9 is also contained in MP(R q ), we will 
choose d so that the right side of the upper bound in 
item 3 is smaller than R q . Choose 



d = 



.{(p-l)/4, {R q /e*y-« }J 



The assumptions that p > 5, e < 1, and R q > 1 guar- 
antee that this is a valid choice satisfying d £ [1, (p — 
l)/4]. The choice also guarantees that 9 C B^(i? g ), 
because 



«^(2-g)/2 
<l + e« (R q /e«)=R q 



\ q q <l + 



for all 9 £ 9. To complete the proof we will show that 
log|9| satisfies the lower bound claimed by the lemma. 
Note that the function a i-> a\og[(p—l)/a] is increasing 
on [0, (p — l)/e] and decreasing on \{p — l)/e,oo). So 
if 



< 



p-l 



then 



log|9| >cd[log(p-l)-logd] 

> (c/2)opog(p-l)-logo] , 

because d = [aj > a/2. Moreover, since d < (p — l)/4 
and the above right hand side is maximized when a = 
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(p — l)/e, the inequality remains valid for all a > if 
we replace the constant (c/2) with the constant 



-[log(p-l)-logH_i] 



( c /2)£^ii > 0.109. 



Proof of Lemma 3.1.3 Let Ai = XtxJ for i = 1,2. 
Then S, = AiAj + \2(I p — Ai). Since Si and T, 2 have 
the same eigenvalues and hence the same determinant, 



l>(h 



[Tr^Ex) -p - logdettE^Ei)] 



Ai 




A 2 




Ai 




Ai 




A 2 




Ai 




Ai 




A 2 


Ai 



= -[Tr(E 2 - 1 S 1 )-p] 
= ^Tr(E 2 - 1 (E 1 -E 2 )). 

The spectral decomposition E2 = AiA 2 + \%{I P — A 2 ) 
allows us to easily calculate that 

E- 1 = \ 2 \l p - A 2 ) + X~ 1 A 2 . 

Since orthogonal projections are idempotent, i.e. 
AiAi = Ai, 

E" 1 (E 1 -E 2 ) 

[(X 1 /X 2 )(I p -A 2 )+A 2 ](A 1 -A 2 ) 

[(Ai/Aa)(/ P - A 2 )A 1 - A 2 (A 2 - A{)] 

[(Ai/A 2 )(/ P - A 2 )A, - A 2 (I P - AO] . 

Using again the idempotent property and symmetry 
of projection matrices, 

Tr((I p - A^A,) 
= Ti((I p -A 2 )(I p -A 2 )A 1 A 1 ) 
= Ti(A 1 (I p -A 2 )(I p -A 2 )A 1 ) 
= \\A 1 (I P -A 2 )\\% 

and similarly, 

Tt{A 2 (I p -A 1 )) = \\A 2 {I p -A 1 )\\ 2 f . 

By Lemma A. 1.1, 



\\A, (I P - A 2 )\\% = \\A 2 (I P AOHl = -\\A 1 -A 2 \\%. 



Thus, 

^(EaHSi-Sa)) 
and the result follows. 



(Ar - A 2 ) 2 
2AiA 2 



\Aj. -A 2 \\ 2 F 



A. 3 Proofs for Theorem 2.2 

Proof of Lemma 3.2.1 We begin with the expan- 
sion, 

(£,0i0f -09 T ) 

= Tr{E0il?f } - Tr{E6>6» T } 

= Tr{E(/ p - ee T )e 1 e{} - Tr{Y,ee T {i p - 6>i0f )} . 

Since 0\ is an eigenvector of E corresponding to the 
eigenvalue Ai, 



Tr{E(ip — uu )u\i 

= Tr{0i0f E(/ p - ee T )e 1 eJ } 

= A 1 Tr{0 1 0f(/ p -00 r )MD 

= Ai Tr{^i^f(/ P - ee T ) 2 e 1 e'[ } 
= \ 1 \\e l el{i p -ee T )\\ 2 F . 

Similarly, we have 

Tr{E^ T (/ p -MD} 

= Tr{(/ P - 0!0f )E00 T (/ P - 6^1)} 

= Tv{e T {I p - 6 X 6[ )E(7 P - 0^)0} 

< \ 2 Ti-{6 T (I p -6 1 6l) 2 0} 

= X 2 \\00 T (I p - 6^)11%. 

Thus, 

qT anT\ \ /\ \ winnT 



T\i|2 



(E.fli^ -06 1 )> (Ar - A 2 )||00 J (I p - 6 x 6i )nF 
= ^(Ai-A 2 )|| 



The last inequality follows from Lemma A. 1.1. ■ 

Proof of Lemma 3.2.2 Since the distribution of 
S — E does not depend on [i = EA^, we assume without 
loss of generality that [i = 0. Let a,b E {1, . . . ,p} and 

1 - 

D a b = — / (A m ) a (A m )6 — E a f, 

i=l 
1 - 

n 

i=l 

Then 

(S* — E) ab = D ab — X a X b . 

Using the elementary inequality 2|o6| < a 2 + b 2 , we 
have by Assumption 2.2 that 

<max|||(A 4 ,l a )| 2 ||^ 

a 

<2max||(E 1 / 2 Z l ,l a )||2 2 
< 2Ai K 2 . 
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In the third line, we used the fact that the -01-norm is 
bounded above by a constant times the ^2-norm [see 
24, p. 95]. By a generalization of Bernstein's Inequality 
for the -01-norm [see 24, Section 2.2] , for alH > 

F(\D ab \ > 8t\iK 2 ) < P(\{D ab \ > itUilM 

< 2 exp(-n min{t, t 2 }/2) . 

This implies [24, Lemma 2.2.10] the bound 

II max|£> a J|| 



< cK Ai max • 



Similarly, 

2||M|U < |||(X,l a )| 2 ||^ + |||(X,l b )| 2 ||^ 

<\\{x,i a )\\l 2 + \\{x,i b )\\l 2 



(22) 



< -E!K x - la )!l 2 ^ + !K x - lb )H" 2 



n 



So by a union bound [24, Lemma 2.2.2], 



max \X a X b \\\, <cK \i 

ab "Vi n 



(23) 



Adding eqs. (22) and (23) and then adjusting the con- 
stant c gives the desired result, because 



HllvecOS-EJIIooll 
< || max|D o6 ||| 

ab V 1 



max|A Q A b | 

ab ^ 



Proof of Lemma 3.2.3 Let B = S^ 1 n» 1 ^ni / 
We will use a recent result in empirical process theory 
due to Mendelson [16] to bound 

sup b T (S-Y,)b. 

beB 

The result uses Talagrand's generic chaining method, 
and allows us to reduce the problem to bounding the 
suprcmum of a Gaussian process. The statement of 
the result involves the generic chaining complexity, 
72(5, d), of a set B equipped with the metric d. Wc 
only use a special case, 72(5, || ■ H2), where the com- 
plexity measure is equivalent to the expectation of the 
supremum of a Gaussian process on B. We refer the 
reader to [23] for a complete introduction. 

Lemma A. 3.1 (Mendelson [16]). Let Zi, i = 1, . . . ,n 
be i.i.d. random variables. There exists an absolute 



constant c for which the following holds. If J 7 is a 
symmetric class of mean-zero functions then 



E sup 



n 

-]T/ 2 (Z.0-Ej- 2 (Z 4 
a * — ^ 



< c max < d 



72 (J 7 72 (-^. fa) 



l ipi ' 



where = sup^^H/U^. 

Since the distribution of S — £ does not depend on 
/i = KXi, we assume without loss of generality that 
/x = 0. Then \b T (S — S)6| is bounded from above by a 
sum of two terms, 



b XX b, 



which can be rewritten as 



D 1 (b) := 



1 - 

-V(Z i ,I] 1 /25)2_ ]E(ZiiS] l/2 6 ^ 

71 



and D 2 (b) := {Z,Z^ 2 b) 2 , respectively. To apply 
Lemma A. 3.1 to Z?i, define the class of linear func- 
tionals 

T:= {(-,S 1/2 6) : be B} . 

Then 



sup£>i(&) = sup 

beB feF 



1 ™ 

-^/ 2 (Z,)-E/ 2 (Z t ) 

n — J 



and we are in the setting of Lemma A. 3.1. 
First, we bound the ^-diameter of T . 

d l j )1 — sup||(Zi, E 1/2 6)||^ 1 

beB 

< csuplKZ^S 1 /^)!!^ . 

beB 

By Assumption 2.2, 

\\{Z t X /2 b)\U, 2 < K\\^ 2 b\\ 2 <K\\ /2 

and so 



d^j < cK 



1/2 



(24) 



Next, we bound 72 (J r ,ip2) by showing that the met- 
ric induced by the tp2 -norm on T is equivalent to the 
Euclidean metric on B. This will allow us to reduce 
the problem to bounding the supremum of a Gaussian 
process. For any /, g e J 7 , by Assumption 2.2, 

\\{f - g){Zi)\U 2 = \\(z l ,^/ 2 (b f -b g ))\\^ 
<K\\^ 2 (bf-b g )h 
<K\{ /2 \\b f -b g \\ 2 , (25) 
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where b f ,b g £ B. Thus, by [23, Theorem 1.3.6], 
72(^2) <cK\\ /2 l2 {B,\\-\\ 2 ). 

Then applying Talagrand's Majorizing Measure The- 
orem [23, Theorem 2.1.1] yields 

72(^,^2) < cA'A} /2 Esup(r,fo) , (26) 

beB 

where Y is a p-dimensional standard Gaussian random 
vector. Recall that B = (i£i) n S^ -1 . So 

Esup(F,fc)<E sup (Y,b). 
beB beBf (fli)nef(i) 

Here, we could easily upper bound the above quantity 
by the suprcmum over M^Rq) alone. Instead, we use a 
sharper upper bound due to Gordon et al. [8, Theorem 
5.1]: 



E sup (Y, b)<R ls j2 + log(2p/i?f) 

6eBf (Hi)nif(i) 



<2R 1 ^log(p/Rj), 

where we used the assumption that R\ < p/e in the 
last inequality. Now we apply Lemma A. 3.1 to get 



EsupL>i(5) 

beB 



<cK>X^{R 1 \hW R *\R^' R V 



Turning to D2(b), we can take n = 1 in Lemma A. 3.1 
and use a similar argument as above, because 

D 2 (b) < \{Z, Y} /2 b) 2 - E(Z, Y} /2 b) 2 \ + E(Z, Y}l 2 b) 2 . 

We just need to bound the tp 2 -norms of f(Z) and 
(f—g)(Z) to get bounds that are analogous to eqs. (24) 
and (25). Since Z is the sum of the independent ran- 
dom variables Zi/n, 



beB 



SU P ||/(Z)||2 2 =SU P ||<^£ 1 /2 6/) ||2 2 

beB 

n 

< sup cY^m^^b^wijr 



beB 



< supcA' 2 Ai||&/||2/n 

beB 

< cK 2 Xi/n, 

and similarly, 

\\(f-g)(Z)U 2 <cKXi\\b f -b g \\l/n. 



So repeating the same arguments as for D\, we get a 
similar bound for D 2 . Finally, we bound KD 2 (b) by 

n n 

E(X,b) 2 = 6 T (^^EA 1 Aj/n 2 )6 

i=i j=i 



n 

6 T (^EA J A, T /n 2 )6 



i=l 

= \\^ 2 bg/n 
< Xi/n. 

Putting together the bounds for Di and D 2 and then 
adjusting constants completes the proof. ■ 

Proof of Lemma 3.2.4 Using a similar argument 
as in the proof of Lemma 3.2.3 we can show that 

f A A 2 

E sup \b T (S-X)b\ <cA 2 A im ax<^ —=, — 
where 

A = E sup (Y, b) 

6? _1 nBg(d) 

and Y is a p-dimcnsional standard Gaussian Y . Thus 
we can reduce the problem to bounding the suprcmum 
of a Gaussian process. 

LetA/"c §2 _1( ^®o(^) be a minimal 5-covering of §2 _1 ^ 
Bq(c?) in the Euclidean metric with the property that 
for each x G Sf -1 nBg(d) there exists y G N satisfying 
\\x - y || 2 < 5 and x — y G IBq (cf) . (We will show later 
that such a covering exists.) 

Let b* G Sf" 1 nlg(d) satisfy 

sup (y,6) = {y,6*). 
sg-'nogCrf) 

Then there is 6 G Af such that ||6* — b\\ 2 < S and 
G Bg(d). Since (6* -&|| 2 G Sf _1 nBg(d), 

(y,6*) = (y,&*-&) + (r,6) 

< <5 sup (Y, u) + (Y, b) 

uesg^nBgfd) 

< 5(Y,b*) +max(y,6) . 
~ x ' ' be^ x ■ ' 



Thus, 



sup (Y,b)<(l-S)- 

bGSf _1 nBg(d) 



beJV 



(Y,b) 



Since (Y, b) is a standard Gaussian for every b G A/", a 
union bound [24, Lemma 2.2.2] implies 



:max(y,&) < cyOoglATj 
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for an absolute constant c > 0. Thus, 

E sup (Y, b) < c(l - ^)- 1 ^/fog\^f\ 

6eS| _1 nBg(d) 

Finally, we will bound log|A/"| by constructing a <5- 
covering set and then choosing 5. It is well known 
that the minimal (5-covering of Sj -1 in the Euclidean 
metric has cardinality at most (1 + 2/5) d . Associate 
with each subset / C { 1 , . . . , p} of size d, a mini- 
mal (5-covering of the corresponding isometric copy of 
§2 -1 . This set covers every possible subset of size d, 
so for each x <G n Mg(d) there is y <G N satisfying 
\\ x ~ y\\2 < S and x — y G B (<i). Since there are (p 
choose d) possible subsets, 

log|A/| <logQ + dlog(l + 2/tf) 

<log(^) d + dlog(l + 2/ ( 5) 

= d + dlog(p/d) + d\og(l + 2/5). 

In the second line, we used the binomial coefficient 
bound ( p d ) < (ep/d) d . If wc take (5=1/4, then 

\og\Af\ <d + d\og(p/d) + dlog9 
< cdlog(p/d) , 

where we used the assumption that d < p/2. Thus, 

A = E sup (Y, b) < cd\og{p/d) 

sr 1 n»g(d) 



for all d G [l,p/2). 



